This part is a baseline model to do Topic Modelling. In this project, the goal is to fit an LDA model to the news articles and analyze and interpret the results intuitively. Here are the main elements of the analysis:
First to preprocess the data by cleaning it. Once the data is cleaned (see first section for the details of the cleaning implementation), some descriptive statistics are offered. In particular, the number of distinct words and the most frequent words appearing overall and by category.
In the subsequent section, we use publicly available lexicons to attribute “sentiments” to the words of our Corpus. We then analyse the amount of positive and negative sentiments in the Corpus.
Finally, we fit an LDA model and analyze what every Topic contains.
Main source: Text Mining in R.
NB: The second project (‘Text Mining’) analyzes the differences we do not simply take into account the stopwords but we use instead a frequency based approach. The second project also includes bigrams instead of single words.
First we have to clean and preprocess the data. The following steps are made to do so:
First we import the data.
## -- Import data --
data = read_csv("news_en.csv")
documents = data[c(1,3,4)]Then we do preliminary cleaning. In particular, keeping only the articles with at least 250 words (arbitrary number, approximation). Then we randomly select a subsample of size \(N\).
# Keep only the articles for which there are at least (approx.) 250 words.
documents = documents %>%
mutate(words = sapply(strsplit(Body, " "), length)) %>%
filter(words >= 250) %>%
select(-words)
# From this we select only a random subsamble of size 1000
set.seed(1234)
N = 1000 # number of articles to select
index = seq(1, dim(documents)[1], by = 1)
index_sample = c(sort(sample(index, N, replace = FALSE)))
# Select the random articles
documents = documents[index_sample, ]
# Load the list of stop words
data("stop_words")articles = data_frame()
for(i in 1:N){
# Get the article
doc = documents$Body[i]
# Turns it to a string, remove numbers and put each sentence into a line
# Each sentence is assumed to be separated by ". "
text = toString(doc) %>%
removeNumbers() %>%
strsplit(". ", fixed = TRUE) %>%
unlist() %>%
tolower()
# Remove Punctuations
text = gsub("[,]", "", text)
text = gsub("[.]", "", text)
text = gsub("[!]", "", text)
text = gsub("[?]", "", text)
text = gsub("[:]", "", text)
text = gsub("’'s", "", text, fixed = TRUE)
text = gsub("reuters", "", text, fixed = TRUE)
# Turn the dataFrame into a data_frame (tipple), necessary for further parts
# Also removes the stop words
text_df <- data_frame(line = 1:length(text), text = text) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
# Add information about the id (article number) and category, for descriptive stats
text_df$id = rep(documents$id[i], dim(text_df)[1])
text_df$Kat = rep(documents$Kat[i], dim(text_df)[1])
# Put it together to get the articles
articles = rbind(articles, text_df)
}
head(articles, 5)## # A tibble: 5 x 4
## line word id Kat
## <int> <chr> <int> <chr>
## 1 1 federal 11 Top-Nachrichten
## 2 1 judge 11 Top-Nachrichten
## 3 1 tuesday 11 Top-Nachrichten
## 4 1 denied 11 Top-Nachrichten
## 5 1 inc’s 11 Top-Nachrichten
Now we have a good data set.
The descriptive statistics concern the whole Corpus.
# Number of distinct words.
nwords = length(unique(articles$word))
nwords## [1] 27042
# Number of "documents"
ndocs = length(unique(articles$id))
ndocs## [1] 1000
# Number of articles / proportion by category
articles %>%
distinct(id, Kat) %>%
group_by(Kat) %>%
summarize(
n = n(),
proportion = n/ndocs
) %>%
ungroup()## # A tibble: 5 x 3
## Kat n proportion
## <chr> <int> <dbl>
## 1 Breakingviews 1 0.001
## 2 Inlandsnachrichten 139 0.139
## 3 Politik 212 0.212
## 4 Top-Nachrichten 272 0.272
## 5 Weltnachrichten 376 0.376
# Average number of words by category
# Note that it doesn't have the stop words anymore and that the first
# filter to have at least 250 words is an approx.
articles %>%
group_by(Kat, id) %>%
summarize(
n = n()
) %>%
ungroup() %>%
group_by(Kat) %>%
summarize(
averageWords = round(mean(n,1))
)## # A tibble: 5 x 2
## Kat averageWords
## <chr> <dbl>
## 1 Breakingviews 236
## 2 Inlandsnachrichten 245
## 3 Politik 294
## 4 Top-Nachrichten 264
## 5 Weltnachrichten 280
And now some basic graphical representations:
# For the two graphs, be careful to adjust the filter(n > NUMBER) to get a reasonable graph
# Most frequent words
articles %>%
count(word, sort = TRUE) %>%
top_n(10, wt = n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity") +
xlab(NULL) +
coord_flip() +
theme_bw() +
ggtitle("10 Most Frequent Words (overall)")# Find the top 10 words
top10 = articles %>%
count(word, sort = TRUE) %>%
top_n(10)
# Get the positions ot be the same as the graph before
positions = rev(top10$word)
# Most frequent words by category
articles %>%
group_by(Kat) %>%
count(word, sort = TRUE) %>%
filter(word %in% top10$word) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(reorder(word, n), n, fill = Kat)) +
geom_bar(stat = "identity") +
xlab(NULL) +
scale_x_discrete(limits = positions)+
coord_flip() +
theme_bw() +
ggtitle("Most frequent words by category")For simplicity, we only keep words belonging to categories “positive” and “negative”. We start by comparing two different lexicons.
# Compare different lexicon
wordsSentiment = c()
lexicons = c("nrc", "bing")
i = 0
for(lex in lexicons){
i = i + 1
SA = get_sentiments(lex) %>%
filter(sentiment == "positive" | sentiment == "negative")
# Count the number of words found by the lexicon
sent_data = articles %>%
inner_join(SA)
# Proportion of unique words comparison
wordsSentiment[i] = length(unique(sent_data$word)) / length(unique(articles$word))
}
print(wordsSentiment)## [1] 0.1146735 0.1033208
# Get the most represented lexicon
pos = which(wordsSentiment == max(wordsSentiment))
SA = get_sentiments(lexicons[pos]) %>%
filter(sentiment == "positive" | sentiment == "negative")Note that changing the lexicon has a huge impact on the results. Indeed, “Trump” is not recognised in “nrc”, but it’s a very frequent word overall. Therefore, it changes the rest of the analysis.
Here are some descriptive tables.
# Tables
articles %>%
inner_join(SA) %>%
count(sentiment,sort = TRUE)## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 positive 32702
## 2 negative 22232
articles %>%
inner_join(SA) %>%
count(word, sentiment,sort = TRUE)## # A tibble: 3,151 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 president positive 1380
## 2 united positive 1027
## 3 government negative 985
## 4 police positive 663
## 5 including positive 562
## 6 deal positive 501
## 7 white positive 501
## 8 foreign negative 486
## 9 statement positive 440
## 10 war negative 396
## # ... with 3,141 more rows
Some wordclouds.
# Basic Wordcloud
articles %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))# Adding sentiments
articles %>%
inner_join(
get_sentiments("nrc") %>% filter(sentiment == "positive" | sentiment == "negative")
) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "darkgreen"),
max.words = 100)Now we try to fit the model. First we have to turn our data_frame object into a DocumentTermMatrix object, to which we will then fit the LDA model.
First, we have to turn our data into a DocumentTermMatrix object, to be able to fit the data to an lda model. In between, we add again the sentiments.
# Get a DocumentTermMatrix object
new_data = articles[c(3,2)]
colnames(new_data) = c("document", "term")
# Here is the required format
new_data = new_data %>%
group_by(document, term) %>%
mutate(count = n()) %>%
distinct() %>%
ungroup()
head(new_data, 5)## # A tibble: 5 x 3
## document term count
## <int> <chr> <int>
## 1 11 federal 1
## 2 11 judge 2
## 3 11 tuesday 1
## 4 11 denied 1
## 5 11 inc’s 1
# Here is the transformation to get the DocumentTermMatrix object
dtm = new_data %>%
cast_dtm(document, term, count)
dtm## <<DocumentTermMatrix (documents: 1000, terms: 27042)>>
## Non-/sparse entries: 202807/26839193
## Sparsity : 99%
## Maximal term length: 27
## Weighting : term frequency (tf)
inspect(dtm[1:4, 1:18])## <<DocumentTermMatrix (documents: 4, terms: 18)>>
## Non-/sparse entries: 32/40
## Sparsity : 56%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs denied department donald federal house judge justice president trump
## 11 1 3 1 1 1 2 5 2 3
## 18 0 0 1 1 4 0 1 2 0
## 28 1 0 0 0 0 0 0 3 0
## 48 0 0 1 1 1 0 0 2 3
## Terms
## Docs tuesday
## 11 1
## 18 0
## 28 3
## 48 0
Again, we add the positive and negative sentiment to the tidied object.
# Tidy the DocumentTermMatrix Object
dtm_tidy = tidy(dtm)
dtm_tidy## # A tibble: 202,807 x 3
## document term count
## <chr> <chr> <dbl>
## 1 11 federal 1
## 2 18 federal 1
## 3 48 federal 1
## 4 75 federal 2
## 5 106 federal 1
## 6 120 federal 1
## 7 447 federal 3
## 8 556 federal 2
## 9 581 federal 4
## 10 597 federal 2
## # ... with 202,797 more rows
# This will create a new tibble with the "terms" (= words) indicating in which
# document they appear (document = article), and how many times each word appear by document.
# Add the sentiments
dtm_sentiment <- dtm_tidy %>%
inner_join(
get_sentiments("nrc") %>% filter(sentiment == "positive" | sentiment == "negative"),
by = c(term = "word")
)
head(dtm_sentiment,5)## # A tibble: 5 x 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 11 denied 1 negative
## 2 28 denied 1 negative
## 3 355 denied 1 negative
## 4 702 denied 1 negative
## 5 997 denied 3 negative
Thanks to this step, we can say what are the estimated 5 more negative and positive articles, and print the headlines. Checking quickly the headlines allows us to assess the quality of the sentiment.
# Top 5 articles most negative sentiments -> with headlines
# Increase the id of the documents + 1 (because I did so previously in the code)
data$id = as.character(data$id + 1)
colnames(data)[1] = "document"
worst5 = dtm_sentiment %>%
count(document, sentiment, wt = count) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment) %>%
top_n(-5, wt = sentiment) %>%
inner_join(data[,c(1,2,4,5)], by ="document") %>%
arrange(sentiment)
# Top 5 articles most positive sentiments
top5 = dtm_sentiment %>%
count(document, sentiment, wt = count) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment) %>%
top_n(5, wt = sentiment) %>%
inner_join(data[,c(1,2,4,5)], by ="document") %>%
arrange(desc(sentiment))
# Print the headlines
print("Headlines of most negative articles:")## [1] "Headlines of most negative articles:"
for(i in 1:5){
print(paste(as.character(i), ".", worst5$Headline[i]))
}## [1] "1 . Intel hit with 32 lawsuits over security flaws"
## [1] "2 . Pope appeals to Syria's Assad to respect humanitarian law, let aid in"
## [1] "3 . Oil up 3 percent as Brexit chances dim; gasoline surges too"
## [1] "4 . Iran says we do not want nuclear weapons, there is no sunset clause in nuclear deal"
## [1] "5 . Turkey-EU deal aims to discourage illegal migration, Davutoglu says"
print("Headlines of most positive articles:")## [1] "Headlines of most positive articles:"
for(i in 1:5){
print(paste(as.character(i), ".", top5$Headline[i]))
}## [1] "1 . Cuba's Fidel Castro made revolutionary mark on history"
## [1] "2 . Syria opposition's Hijab says need general ceasefire, talks at dead end"
## [1] "3 . White House adviser says return to Florida Keys may take weeks"
## [1] "4 . Britain ready to take other steps against Russian provocation: May's spokesman"
## [1] "5 . 'The Power of Love': address by U.S. bishop at Harry and Meghan's wedding"
# Contribution to sentiments
dtm_sentiment %>%
count(sentiment, term, wt = count) %>%
filter(n >= N/5) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment") # k^ is the number of topics. We can have more but 2 topics is easier to get a representation
# and to understand what's going on
ap_lda <- LDA(dtm, k = 5, control = list(seed = 1234))
ap_lda## A LDA_VEM topic model with 5 topics.
# Most likely word by topic
get_terms(ap_lda)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## "percent" "people" "minister" "trump" "china"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics## # A tibble: 135,210 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 federal 1.23e- 3
## 2 2 federal 3.28e- 4
## 3 3 federal 2.42e- 4
## 4 4 federal 3.40e- 3
## 5 5 federal 9.30e- 4
## 6 1 judge 4.45e- 4
## 7 2 judge 8.64e- 5
## 8 3 judge 4.81e-36
## 9 4 judge 2.17e- 3
## 10 5 judge 2.02e- 5
## # ... with 135,200 more rows
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# Most frequent words by topics
ap_top_terms %>%
# mutate(term = reorder(term, beta)) %>%
ggplot(aes(x = reorder(term,beta), y = reorder(beta, term), fill = factor(topic))) +
geom_col(show.legend = FALSE) +
theme_bw() +
facet_wrap(~ topic, scales = "free") +
coord_flip()Some information about the \(\beta\) of the model.
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(
topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001
) %>%
mutate(log_ratio = log2(topic2 / topic1)) %>%
arrange(log_ratio)
# beta_spread
# Show words at the limit between two topics ?
# Only interpretable for K = 2 (two topics) ?
beta_spread %>%
select(log_ratio, term) %>%
filter(abs(log_ratio) < 0.25) %>%
arrange(log_ratio) %>%
ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
geom_bar(stat = "identity") +
coord_flip()We can also show the relationship between Topics and Categories
# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)
source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
#join to the tidy form to get the genre field
inner_join(category, by = "document") %>%
select(Kat, topic, gamma) %>%
group_by(Kat, topic) %>%
#avg gamma (document) probability per genre/topic
mutate(mean = mean(gamma)) %>%
select(Kat, topic, mean) %>%
ungroup() %>%
#re-label topics
mutate(topic = paste("Topic", topic, sep = " ")) %>%
distinct()
circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
"Topic 4" = "grey", "Topic 5" = "grey")
#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship, grid.col = grid.col, annotationTrack = "grid",
preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")# k is the number of topics
ap_lda <- LDA(dtm, k = 8, control = list(seed = 1234))
ap_lda## A LDA_VEM topic model with 8 topics.
# Most likely word by topic
get_terms(ap_lda)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## "company" "people" "minister" "trump" "china" "percent"
## Topic 7 Topic 8
## "percent" "syria"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics## # A tibble: 216,336 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 federal 0.00102
## 2 2 federal 0.000446
## 3 3 federal 0.000280
## 4 4 federal 0.00378
## 5 5 federal 0.0000000685
## 6 6 federal 0.000820
## 7 7 federal 0.00250
## 8 8 federal 0.0000974
## 9 1 judge 0.000473
## 10 2 judge 0.000131
## # ... with 216,326 more rows
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# Most frequent words by topics
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>% ggplot(aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
theme_bw() +
facet_wrap(~ topic, scales = "free") +
coord_flip()Some information about the \(\beta\) of the model.
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(
topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001 | topic6 > .001 | topic7 > .001 | topic8 > .001
) %>%
mutate(log_ratio = log2(topic2 / topic1)) %>%
arrange(log_ratio)
# beta_spread
# Show words at the limit between two topics ?
# Only interpretable for K = 2 (two topics) ?
beta_spread %>%
select(log_ratio, term) %>%
filter(abs(log_ratio) < 0.25) %>%
arrange(log_ratio) %>%
ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
geom_bar(stat = "identity") +
coord_flip()We can also show the relationship between Topics and Categories
# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)
source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
#join to the tidy form to get the genre field
inner_join(category, by = "document") %>%
select(Kat, topic, gamma) %>%
group_by(Kat, topic) %>%
#avg gamma (document) probability per genre/topic
mutate(mean = mean(gamma)) %>%
select(Kat, topic, mean) %>%
ungroup() %>%
#re-label topics
mutate(topic = paste("Topic", topic, sep = " ")) %>%
distinct()
circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
"Topic 4" = "grey", "Topic 5" = "grey", "Topic 6" = "grey",
"Topic 7" = "grey", "Topic 8" = "grey")
#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship, grid.col = grid.col, annotationTrack = "grid",
preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")# k^ is the number of topics. We can have more but 2 topics is easier to get a representation
# and to understand what's going on
ap_lda <- LDA(dtm, k = 12, control = list(seed = 1234))
ap_lda## A LDA_VEM topic model with 12 topics.
# Most likely word by topic
get_terms(ap_lda)## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## "company" "israel" "trade" "trump" "china" "european"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## "police" "syrian" "percent" "court" "islamic" "percent"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics## # A tibble: 324,504 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 federal 1.79e- 3
## 2 2 federal 7.14e- 5
## 3 3 federal 4.50e- 5
## 4 4 federal 3.77e- 3
## 5 5 federal 3.03e-20
## 6 6 federal 1.63e-10
## 7 7 federal 3.23e- 3
## 8 8 federal 1.26e- 5
## 9 9 federal 1.34e- 3
## 10 10 federal 1.12e- 3
## # ... with 324,494 more rows
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# Most frequent words by topics
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>% ggplot(aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
theme_bw() +
facet_wrap(~ topic, scales = "free") +
coord_flip()Some information about the \(\beta\) of the model.
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(
topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001 | topic6 > .001 | topic7 > .001 | topic8 > .001 | topic9 > .001 | topic10 > .001 | topic11 > .001 | topic12 > 0.01
) %>%
mutate(log_ratio = log2(topic2 / topic1)) %>%
arrange(log_ratio)
# beta_spread
# Show words at the limit between two topics ?
# Only interpretable for K = 2 (two topics) ?
beta_spread %>%
select(log_ratio, term) %>%
filter(abs(log_ratio) < 0.25) %>%
arrange(log_ratio) %>%
ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
geom_bar(stat = "identity") +
coord_flip()We can also show the relationship between Topics and Categories
# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)
source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
#join to the tidy form to get the genre field
inner_join(category, by = "document") %>%
select(Kat, topic, gamma) %>%
group_by(Kat, topic) %>%
#avg gamma (document) probability per genre/topic
mutate(mean = mean(gamma)) %>%
select(Kat, topic, mean) %>%
ungroup() %>%
#re-label topics
mutate(topic = paste("Topic", topic, sep = " ")) %>%
distinct()
circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
"Topic 4" = "grey", "Topic 5" = "grey", "Topic 6" = "grey",
"Topic 7" = "grey", "Topic 8" = "grey", "Topic 9" = "grey", "Topic 10" = "grey", "Topic 11" = "grey", "Topic 12" = "grey")
#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship, grid.col = grid.col, annotationTrack = "grid",
preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")Now we try two different approaches to account for the shortcomings of the previous part.
colnames(data)[1] = "id"
data$id = as.character(data$id)
articles$id = as.character(articles$id)
dat <- articles %>%
left_join(data[, c("id", "Date")]) %>%
select(-line)
total_words <- dat %>%
group_by(Kat) %>%
count(word, sort = T) %>%
mutate(total = sum(n)) %>%
mutate(fraction = n/total)
## Plot the words frequency
ggplot(total_words, aes(fraction, fill = Kat)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~Kat, ncol = 2, scales = "free_y")# Term Frequency
freq_by_rank <- total_words %>%
group_by(Kat) %>%
distinct() %>%
arrange(desc(fraction)) %>%
mutate(rank = row_number())
# Exponential decay
freq_by_rank %>%
ggplot(aes(rank,fraction, color = Kat)) +
geom_line(size = 1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
labs(title = "Frequency by rank") +
theme_bw()# Calculate and Bind the term frequency
dat <- total_words %>%
bind_tf_idf(word, Kat, n)
dat3 <- dat %>%
group_by(Kat) %>%
distinct(tf_idf, .keep_all = T) %>%
top_n(n = 10, wt = tf_idf)
dat3 %>%
ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Kat)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Kat, ncol = 2, scales = "free") +
coord_flip()dat2 <- data[index_sample,]
dat2 <- dat2 %>%
unnest_tokens(bigram, Body, token = "ngrams", n = 2) %>%
select(-Headline)
dat2$bigram <- gsub("[^A-Za-z ]","",dat2$bigram)
dat2 <- dat2 %>%
separate(bigram, c("word1", "word2"), sep = " ")
dat2 <- dat2 %>%
filter(!word1 == "",
!word2 == "")
dat3 <- dat2 %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
##reunite the bigrams after having removed stop words
bigrams_united <- dat3 %>%
unite(bigram, word1, word2, sep = " ")
total_words <- bigrams_united %>%
group_by(Kat) %>%
count(bigram, sort = T) %>%
mutate(total = sum(n)) %>%
mutate(fraction = n/total)
dat3 <- total_words %>%
bind_tf_idf(bigram, Kat, n)
## Plot bigram words
dat3 %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(Kat) %>%
distinct(tf_idf, .keep_all = T) %>%
top_n(n = 10, wt = tf_idf) %>%
ungroup() %>%
ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Kat)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Kat, ncol = 2, scales = "free") +
coord_flip() Check association of words in the bigram:
dat3 %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(word1 == "gun") %>%
count(word2, sort = TRUE)## # A tibble: 27 x 3
## # Groups: Kat [3]
## Kat word2 nn
## <chr> <chr> <int>
## 1 Inlandsnachrichten control 1
## 2 Inlandsnachrichten demonstrators 1
## 3 Inlandsnachrichten enthusiasts 1
## 4 Inlandsnachrichten exports 1
## 5 Inlandsnachrichten king 1
## 6 Inlandsnachrichten laws 1
## 7 Inlandsnachrichten lobby 1
## 8 Inlandsnachrichten maker 1
## 9 Inlandsnachrichten makers 1
## 10 Inlandsnachrichten manufacturers 1
## # ... with 17 more rows
Issues sentiment Analysis of negation:
negation_words <- c("not", "no", "never", "without")
AFINN <- get_sentiments("afinn")
dat4 <- dat2 %>%
filter(word1 %in% negation_words) %>%
filter(!word2 %in% stop_words$word) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, score, sort = TRUE)
dat4 %>%
mutate(contribution = n * score) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * score, fill = n * score > 0)) +
geom_col(show.legend = FALSE) +
xlab(paste0("Words preceded by negation words")) +
ylab("Sentiment score * number of occurrences") +
coord_flip()Plots sentiment over the first 1’000 articles.
dat <- data[1:1000,] %>% unnest_tokens(word, Body)
data(stop_words)
dat <- dat %>% anti_join(stop_words, by = "word") %>%
select(-Headline)
dat$word <- gsub("[^A-Za-z]","",dat$word)
dat <- dat %>%
filter(!word == "")
dat$Kat <- as.factor(dat$Kat)
sentim <- dat %>%
filter(!word %in% negation_words) %>%
inner_join(AFINN, by = c(word = "word")) %>%
count(word, score, sort = TRUE)
sentim <- left_join(dat, sentim, by ="word") %>%
filter(!score == "NA")
total_sentim <- sentim %>%
group_by(Kat) %>%
mutate(total = sum(score)) %>%
mutate(fraction = n/total)
total_sentim <- total_sentim %>%
separate(Date, c("year", "month", "day"), sep = "-") %>%
select(-c("month", "day"))
pirateplot(formula = score ~ Kat + year,
data = total_sentim,
xlab = NULL, ylab = "Sentiment score",
main = "Sentiment score distribution by newspaper and month",
pal = "google",
point.o = .2,
avg.line.o = 1,
theme = 0,
point.pch = 16,
point.cex = 1.5,
jitter.val = .1,
cex.lab = .9, cex.names = .7) Plot bigrams in networks maps.
dat5 <- total_words %>%
bind_tf_idf(bigram, Kat, n)
bigram_graph <- dat5 %>%
filter(n > 45) %>%
graph_from_data_frame()
bigram_graph## IGRAPH 7ff17bb DN-- 30 31 --
## + attr: name (v/c), n (e/n), total (e/n), fraction (e/n), tf
## | (e/n), idf (e/n), tf_idf (e/n)
## + edges from 7ff17bb (vertex names):
## [1] Weltnachrichten->north korea Politik ->white house
## [3] Weltnachrichten->prime minister Politik ->donald trump
## [5] Politik ->president donald Weltnachrichten->told reuters
## [7] Weltnachrichten->north korean Weltnachrichten->united nations
## [9] Weltnachrichten->foreign minister Weltnachrichten->north koreas
## [11] Weltnachrichten->president donald Weltnachrichten->donald trump
## [13] Weltnachrichten->security council Weltnachrichten->south korea
## + ... omitted several edges
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# xlim(-5,7)+
# ylim(1,10)+
xlab(NULL)+
ylab(NULL)+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank())+
theme_bw()